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Abstract 

Training Deep Neural Networks is complicated by the fact 
that the distribution of each layer’s inputs changes during 
training, as the parameters of the previous layers change. 
This slows down the training by requiring lower learning 
rates and careful parameter initialization, and makes it no¬ 
toriously hard to train models with saturating nonlineari¬ 
ties. We refer to this phenomenon as internal covariate 
shift, and address the problem by normalizing layer in¬ 
puts. Our method draws its strength from making normal¬ 
ization a part of the model architecture and performing the 
normalization for each training mini-batch. Batch Nor¬ 
malization allows us to use much higher learning rates and 
be less careful about initialization. It also acts as a regu- 
larizer, in some cases eliminating the need for Dropout. 
Applied to a state-of-the-art image classification model. 
Batch Normalization achieves the same accuracy with 14 
times fewer training steps, and beats the original model 
by a significant margin. Using an ensemble of batch- 
normalized networks, we improve upon the best published 
result on ImageNet classification: reaching 4.9% top-5 
validation error (and 4.8% test error), exceeding the ac¬ 
curacy of human raters. 

1 Introduction 


Using mini-batches of examples, as opposed to one exam¬ 
ple at a time, is helpful in several ways. First, the gradient 
of the loss over a mini-batch is an estimate of the gradient 
over the training set, whose quality improves as the batch 
size increases. Second, computation over a batch can be 
much more efficient than m computations for individual 
examples, due to the parallelism afforded by the modern 
computing platforms. 

While stochastic gradient is simple and effective, it 
requires careful tuning of the model hyper-parameters, 
specifically the learning rate used in optimization, as well 
as the initial values for the model parameters. The train¬ 
ing is complicated by the fact that the inputs to each layer 
are affected by the parameters of all preceding layers - so 
that small changes to the network parameters amplify as 
the network becomes deeper. 

The change in the distributions of layers’ inputs 
presents a problem because the layers need to continu¬ 
ously adapt to the new distribution. When the input dis¬ 
tribution to a learnin g sys tem changes, it is said to experi¬ 
ence covariate shift ( ShimodairaL 200 01). This is typically 
handled via domain adaptation ( JiangT 2008 ). However, 
the notion of covariate shift can be extended beyond the 
learning system as a whole, to apply to its parts, such as a 
sub-network or a layer. Consider a network computing 


^ = F 2 (T’ 1 (u,0 1 ),0 2 ) 


Deep learning has dramatically advanced the state of the 
art in vision, speech, and many other areas. Stochas¬ 
tic gradient descent (SGD) has proved to be an effec¬ 
tive way of trainin g deep networks, and SGD variants 
such as momentum ( Sutskever et all 20131) and Adagrad 
( Duchi et all 2011 ) have been used to achieve state of the 
art performance. SGD optimizes the parameters 0 of the 
network, so as to minimize the loss 


where h\ and F 2 are arbitrary transformations, and the 
parameters 0i, 0 2 are to be learned so as to minimize 
the loss l. Learning 0 2 can be viewed as if the inputs 
x = F\ (u, 0i) are fed into the sub-network 

i = -F 2 (x, 0 2 ). 

For example, a gradient descent step 


0 = arg min — 
6 e AT 


N 

X7( xi > 0 ) 

i =1 


0 2 <r~ 0 2 


a y-i 9-F 2 (xi,0 2 ) 
to Z-* d0 2 

i=i 


where xi...jv is the training data set. With SGD, the train¬ 
ing proceeds in steps, and at each step we consider a mini¬ 
batch xi... m of size to. The mini-batch is used to approx¬ 
imate the gradient of the loss function with respect to the 
parameters, by computing 

1 dl(xj,0) 
to. dO 


(for batch size to and learning rate a) is exactly equivalent 
to that for a stand-alone network F 2 with input x. There¬ 
fore, the input distribution properties that make training 
more efficient - such as having the same distribution be¬ 
tween the training and test data - apply to training the 
sub-network as well. As such it is advantageous for the 
distribution of x to remain fixed over time. Then, 0 2 does 
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not have to readjust to compensate for the change in the 2 
distribution of x. 


Towards Reducing 
Covariate Shift 


Internal 


Fixed distribution of inputs to a sub-network would 
have positive consequences for the layers outside the sub¬ 
network, as well. Consider a layer with a sigmoid activa¬ 
tion function z = g(W u + b) where u is the layer input, 
the weight matrix W and bias vector b are the layer pa¬ 
rameters to be learned, and g(x) = 1+ex p(_ x ) ■ As |x| 
increases, g'(x ) tends to zero. This means that for all di¬ 
mensions of x = kFu+b except those with small absolute 
values, the gradient flowing down to u will vanish and the 
model will train slowly. However, since x is affected by 
W, b and the parameters of all the layers below, changes 
to those parameters during training will likely move many 
dimensions of x into the saturated regime of the nonlin¬ 
earity and slow down the convergence. This effect is 
amplified as the network depth increases. In practice, 
the saturation problem and the resulting vanishing gradi- 
ents are usually addres sed by using Rectified Linear Units 
( Nair & Hintonl 2010h ReLU(x) = max(x, 0), careful 


initialization (Bengio & Glorol 201(1 Saxe et al. . 2013 ). 


and small learning rates. If, however, we could ensure 
that the distribution of nonlinearity inputs remains more 
stable as the network trains, then the optimizer would be 
less likely to get stuck in the saturated regime, and the 
training would accelerate. 


We refer to the change in the distributions of internal 
nodes of a deep network, in the course of training, as In¬ 
ternal Covariate Shift. Eliminating it offers a promise of 
faster training. We propose a new mechanism, which we 
call Batch Normalization , that takes a step towards re¬ 
ducing internal covariate shift, and in doing so dramati¬ 
cally accelerates the training of deep neural nets. It ac¬ 
complishes this via a normalization step that fixes the 
means and variances of layer inputs. Batch Normalization 
also has a beneficial effect on the gradient flow through 
the network, by reducing the dependence of gradients 
on the scale of the parameters or of their initial values. 
This allows us to use much higher learning rates with¬ 
out the risk of divergence. Furthermore, batch normal¬ 
ization r egularizes the model a nd reduces the need for 
Dropout ( Srivastava et al. . 20141) . Finally, Batch Normal¬ 
ization makes it possible to use saturating nonlinearities 
by preventing the network from getting stuck in the satu¬ 
rated modes. 


In Sec. 14.21 we apply Batch Normalization to the best¬ 
performing ImageNet classification network, and show 
that we can match its performance using only 7% of the 
training steps, and can further exceed its accuracy by a 
substantial margin. Using an ensemble of such networks 
trained with Batch Normalization, we achieve the top-5 
error rate that improves upon the best known results on 
ImageNet classification. 


We define Internal Covariate Shift as the change in the 
distribution of network activations due to the change in 
network parameters during training. To improve the train¬ 
ing, we seek to reduce the internal covariate shift. By 
fixing the distribution of the layer inputs x as the training 
progresses, we exp ect to improve the training speed. It has 


been long known (ILeCun et all Il998bt IWiesler & Nevl 


2011) that the network training converges faster if its in¬ 


puts are whitened - i.e., linearly transformed to have zero 
means and unit variances, and decorrelated. As each layer 
observes the inputs produced by the layers below, it would 
be advantageous to achieve the same whitening of the in¬ 
puts of each layer. By whitening the inputs to each layer, 
we would take a step towards achieving the fixed distri¬ 
butions of inputs that would remove the ill effects of the 
internal covariate shift. 


We could consider whitening activations at every train¬ 
ing step or at some interval, either by modifying the 
network directly or by changing the parameters of the 
optimization al gorithm to depend on the network ac¬ 
tivation values dW iesler et ah. 20141: Raikoetal.. 2012 : 
Povev et al. . 2014 : Desjardins & Kavukcuoglulk How¬ 
ever, if these modifications are interspersed with the op¬ 
timization steps, then the gradient descent step may at¬ 
tempt to update the parameters in a way that requires 
the normalization to be updated, which reduces the ef¬ 
fect of the gradient step. For example, consider a layer 
with the input u that adds the learned bias 6, and normal¬ 
izes the result by subtracting the mean of the activation 
computed over the training data: x = x — E[x] where 
x = u + b, X = {xi...jv} is the set of values of x over 
the training set, and E[x] = 4? 1 Xi. If a gradient 

descent step ignores the dependence of E[x] on b, then it 
will update b 4— b + A b, where A b oc —di/dx. Then 
u + (b + A6) — E[u + (6 + A b)} = u + b — E[u + b\. 
Thus, the combination of the update to b and subsequent 
change in normalization led to no change in the output 
of the layer nor, consequently, the loss. As the training 
continues, b will grow indefinitely while the loss remains 
fixed. This problem can get worse if the normalization not 
only centers but also scales the activations. We have ob¬ 
served this empirically in initial experiments, where the 
model blows up when the normalization parameters are 
computed outside the gradient descent step. 


The issue with the above approach is that the gradient 
descent optimization does not take into account the fact 
that the normalization takes place. To address this issue, 
we would like to ensure that, for any parameter values, 
the network always produces activations with the desired 
distribution. Doing so would allow the gradient of the 
loss with respect to the model parameters to account for 
the normalization, and for its dependence on the model 
parameters 0. Let again x be a layer input, treated as a 
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vector, and X be the set of these inputs over the training 
data set. The normalization can then be written as a trans¬ 
formation 

x = Norm(x, X) 


we introduce, for each activation x^ k \ a pair of parameters 
r y( fc ), /3( fc ), which scale and shift the normalized value: 

y (k) = 7 (fc)-(fc) +/ 3 (fe). 


which depends not only on the given training example x 
but on all examples X - each of which depends on 0 if 
x is generated by another layer. For backpropagation, we 
would need to compute the Jacobians 

<9Norm(x, X) <9Norm(x, X) 
and 


1 9x 


dX 


ignoring the latter term would lead to the explosion de¬ 
scribed above. Within this framework, whitening the layer 
inputs is expensive, as it requires computing the covari¬ 
ance matrix Cov[x] = E xg ^-[xx T ] — E[x]E[x] T and its 
inverse square root, to produce the whitened activations 
Cov[x] -1 / 2 (x — E[x]), as well as the derivatives of these 
transforms for backpropagation. This motivates us to seek 
an alternative that performs input normalization in a way 
that is differentiable and does not require the analysis of 
the entire training set after every parameter update. 

Some of the previous approaches (e.g. 


( Lvu & Simoncelli . 20081) ) use statistics computed 


over a single training example, or, in the case of image 
networks, over different feature maps at a given location. 
However, this changes the representation ability of a 
network by discarding the absolute scale of activations. 
We want to a preserve the information in the network, by 
normalizing the activations in a training example relative 
to the statistics of the entire training data. 


3 Normalization 
Statistics 


via Mini-Batch 


Since the full whitening of each layer’s inputs is costly 
and not everywhere differentiable, we make two neces¬ 
sary simplifications. The first is that instead of whitening 
the features in layer inputs and outputs jointly, we will 
normalize each scalar feature independently, by making it 
have the mean of zero and the variance of 1. For a layer 
with d-dimensional input x = (a^ 1 ) ... x^), we will nor¬ 
malize each dimension 


?(k) 


x {k) - E[z (fc) ] 

i/Varf-rWy 


where the expectation and variance are computed over the 
training data set. As shown in ( LeCun et al. . 1998bl) . such 
normalization speeds up convergence, even when the fea¬ 
tures are not decorrelated. 

Note that simply normalizing each input of a layer may 
change what the layer can represent. For instance, nor¬ 
malizing the inputs of a sigmoid would constrain them to 
the linear regime of the nonlinearity. To address this, we 
make sure that the transformation inserted in the network 
can represent the identity transform. To accomplish this, 


These parameters are learned along with the original 
model parameters, and restore the representation power 
of the network. Indeed, by setting 7 W = yVarja-W] and 

p(k) = ^(fc)] 

, we could recover the original activations, 
if that were the optimal thing to do. 

In the batch setting where each training step is based on 
the entire training set, we would use the whole set to nor¬ 
malize activations. However, this is impractical when us¬ 
ing stochastic optimization. Therefore, we make the sec¬ 
ond simplification: since we use mini-batches in stochas¬ 
tic gradient training, each mini-batch produces estimates 
of the mean and variance of each activation. This way, the 
statistics used for normalization can fully participate in 
the gradient backpropagation. Note that the use of mini¬ 
batches is enabled by computation of per-dimension vari¬ 
ances rather than joint covariances; in the joint case, reg¬ 
ularization would be required since the mini-batch size is 
likely to be smaller than the number of activations being 
whitened, resulting in singular covariance matrices. 

Consider a mini-batch B of size m. Since the normal¬ 
ization is applied to each activation independently, let us 
focus on a particular activation x' k] and omit k for clarity. 
We have m values of this activation in the mini-batch, 

B = 

Let the normalized values be and their linear trans¬ 

formations be 2/1 m . We refer to the transform 

: X\ ... m t t/i. ,. m 

as the Batch Normalizing Transform. We present the BN 
Transform in AlgorithmQ] In the algorithm, e is a constant 
added to the mini-batch variance for numerical stability. 


Input: Values of x over a mini-batch: B = {i£i... m }; 

Parameters to be learned: 

7> /3 

Output: 



Tb 

1 m 

-5> 

i= 1 

// mini-batch mean 

<- 

m 

// mini-batch variance 

Xi i — 

Xi - 

// normalize 



Vi <- 

7 Xi + f3 = BN Jt p(xi) 

// scale and shift 


Algorithm 1: Batch Normalizing Transform, applied to 
activation x over a mini-batch. 


The BN transform can be added to a network to manip¬ 
ulate any activation. In the notation y = BN 7 >l g(a;), we 
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indicate that the parameters 7 and (3 are to be learned, 
but it should be noted that the BN transform does not 
independently process the activation in each training ex¬ 
ample. Rather, BN 7 i ^(a;) depends both on the training 
example and the other examples in the mini-batch. The 
scaled and shifted values y are passed to other network 
layers. The normalized activations x are internal to our 
transformation, but their presence is crucial. The distri¬ 
butions of values of any x has the expected value of 0 
and the variance of 1 , as long as the elements of each 
mini-batch are sampled from the same distribution, and 
if we neglect e. This can be seen by observing that 
YJiLi = 0 and -k. YhL 1 %i = 1 . and taking expec¬ 
tations. Each normalized activation x- k> can be viewed as 
an input to a sub-network composed of the linear trans¬ 
form y followed by the other pro¬ 
cessing done by the original network. These sub-network 
inputs all have fixed means and variances, and although 
the joint distribution of these normalized x lk> can change 
over the course of training, we expect that the introduc¬ 
tion of normalized inputs accelerates the training of the 
sub-network and, consequently, the network as a whole. 

During training we need to backpropagate the gradi¬ 
ent of loss l through this transformation, as well as com¬ 
pute the gradients with respect to the parameters of the 
BN transform. We use chain rule, as follows (before sim¬ 
plification): 


at _ at .. 

dxi dyi ' 


dt 

do'i 


E m dt 
2—1 dxi 


{Xi - /is) • -aKofi + e) 3/2 


dt _ I dt 

d/is V '=—-'2=1 Qxi 


-1 




at 

del 


I27Li -2(rEi-/x B ) 


dt _ dl_ . 1 1 at . 2{xj-nB) 1 _dt_ . J_ 

dxi dxi ^ a 2 _|_ e ' 9<xjg m ' dyt3 m 

dt _ dt_ 7 

9 7 ~~ Z-i=l dyi ' 

dt _v-^ m at 

dfH 2 -^i—l dyi 


Thus, BN transform is a differentiable transformation that 
introduces normalized activations into the network. This 
ensures that as the model is training, layers can continue 
learning on input distributions that exhibit less internal co¬ 
variate shift, thus accelerating the training. Furthermore, 
the learned affine transform applied to these normalized 
activations allows the BN transform to represent the iden¬ 
tity transformation and preserves the network capacity. 


3.1 Training and Inference with Batch- 
Normalized Networks 

To Batch-Normalize a network, we specify a subset of ac¬ 
tivations and insert the BN transform for each of them, 
according to Alg. Q] Any layer that previously received 
x as the input, now receives BN(.x'). A model employing 
Batch Normalization can be trained using batch gradient 
descent, or Stochastic Gradient Descent with a mini-batch 
size m > 1, or with any of its variants such as Adagrad 


(IDuchi et al.. 2011). The normalization of activations that 


depends on the mini-batch allows efficient training, but is 
neither necessary nor desirable during inference; we want 
the output to depend only on the input, deterministically. 
For this, once the network has been trained, we use the 
normalization 

x — E[a:] 


x = 




ar Lt 


using the population, rather than mini-batch, statistics. 
Neglecting e, these normalized activations have the same 
mean 0 and variance 1 as during training. We use the un¬ 
biased variance estimate Var[a;] = - 7 ^- ■ Eg [<Tg], where 
the expectation is over training mini-batches of size m and 
cr| are their sample variances. Using moving averages in¬ 
stead, we can track the accuracy of a model as it trains. 
Since the means and variances are fixed during inference, 
the normalization is simply a linear transform applied to 
each activation. It may further be composed with the scal¬ 
ing by 7 and shift by j3, to yield a single linear transform 
that replaces BN(a;). Algorithm U summarizes the proce¬ 
dure for training batch-normalized networks. 


Input: Network N with trainable parameters 0; 

subset of activations {x^}^ =1 
Output: Batch-normalized network for inference, N 1 ^ 

1 : Ng N <— N II Training BN network 
2 : for k = 1... K do 

3: Add transformation y^ = BN 7 (*o (x^) to 

^bn (Alg. Q} 

4: Modify each layer in N^ n with input x ^ to take 

y instead 

5: end for 

6 : Train 1V^ N to optimize the parameters 0 U 

7: A/jjn <— N^ n // Inference BN network with frozen 
// parameters 

8 : for k = 1... K do 

9: //For clarity, x = 2^,7 = 7 ^ k \pB = etc. 

10 : Process multiple training mini-batches B. each of 

size m, and average over them: 

E[x] <- Eg[/ig] 

VarM <- ^Eg[o|] 

11 : In replace the transform y = BN 7 ) ^(x) with 

y = _^ = ^ = . x + (B _ 7E[g] j 

ij Var[.7:| ^ y^Vai'^j+e 

12 : end for 

Algorithm 2: Training a Batch-Normalized Network 


3.2 Batch-Normalized Convolutional Net¬ 
works 

Batch Normalization can be applied to any set of acti¬ 
vations in the network. Here, we focus on transforms 
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that consist of an affine transformation followed by an 
element-wise nonlinearity: 

z = g(Wu + b) 


the gradient during backpropagation and lead to the model 
explosion. However, with Batch Normalization, back- 
propagation through a layer is unaffected by the scale of 
its parameters. Indeed, for a scalar a. 


where W and b are learned parameters of the model, and 
(]{■) is the nonlinearity such as sigmoid or ReLU. This for¬ 
mulation covers both fully-connected and convolutional 
layers. We add the BN transform immediately before the 
nonlinearity, by normalizing x = 14'u -t- b. We could have 
also normalized the layer inputs u, but since u is likely 
the output of another nonlinearity, the shape of its distri¬ 
bution is likely to change during training, and constraining 
its first and second moments would not eliminate the co¬ 
variate shift. In contrast, Wu + b is more likely to have 
a sym metric, non-sparse dist ribution, that is “more Gaus¬ 
sian” ( Hvvarinen & Oia . 200(j) : normalizing it is likely to 
produce activations with a stable distribution. 

Note that, since we normalize 4Fu+b, the bias b can be 
ignored since its effect will be canceled by the subsequent 
mean subtraction (the role of the bias is subsumed by /3 in 
Alg.Q}. Thus, z = g(Ww + b) is replaced with 


z = 5 (BN(4Fu)) 


where the BN transform is applied independently to each 
dimension of x = 14'u, with a separate pair of learned 
parameters ^ k \ /3^ per dimension. 

For convolutional layers, we additionally want the nor¬ 
malization to obey the convolutional property - so that 
different elements of the same feature map, at different 
locations, are normalized in the same way. To achieve 
this, we jointly normalize all the activations in a mini¬ 
batch, over all locations. In Alg.|T] we let B be the set of 
all values in a feature map across both the elements of a 
mini-batch and spatial locations - so for a mini-batch of 
size m and feature maps of size p x q, we use the effec¬ 
tive mini-batch of size nn! = \B\ = m ■ pq. We learn a 
pair of parameters 7 ^ and per feature map, rather 
than per activation. Alg. [2] is modified similarly, so that 
during inference the BN transform applies the same linear 
transformation to each activation in a given feature map. 


3.3 Batch Normalization enables higher 
learning rates 

In traditional deep networks, too-high learning rate may 
result in the gradients that explode or vanish, as well as 
getting stuck in poor local minima. Batch Normaliza¬ 
tion helps address these issues. By normalizing activa¬ 
tions throughout the network, it prevents small changes 
to the parameters from amplifying into larger and subop- 
timal changes in activations in gradients; for instance, it 
prevents the training from getting stuck in the saturated 
regimes of nonlinearities. 

Batch Normalization also makes training more resilient 
to the parameter scale. Normally, large learning rates may 
increase the scale of layer parameters, which then amplify 


BN(Wu) = BN((a4F)u) 


and we can show that 


9BN((oW)u) 

du 


OBN(tVu) 

du 


OBN((otV)u) _ 1 dBN(tVu) 
d\aW) a dW 

The scale does not affect the layer Jacobian nor, con¬ 
sequently, the gradient propagation. Moreover, larger 
weights lead to smaller gradients, and Batch Normaliza¬ 
tion will stabilize the parameter growth. 

We further conjecture that Batch Normalization may 
lead the layer Jacobians to have singular values close to 1 


which is known to be beneficial for training (Saxe et al 


20131) . Consider two consecutive layers with normalized 
inputs, and the transformation between these normalized 
vectors: z = -F(x). If we assume that x and z are Gaussian 
and uncorrelated, and that f’(x) ft: ,/x is a linear transfor¬ 
mation for the given model parameters, then both x and z 
have unit covariances, and / = Cov[z] = JCov[x] J T = 
J J T . Thus, JJ T = /, and so all singular values of J 
are equal to 1 , which preserves the gradient magnitudes 
during backpropagation. In reality, the transformation is 
not linear, and the normalized values are not guaranteed to 
be Gaussian nor independent, but we nevertheless expect 
Batch Normalization to help make gradient propagation 
better behaved. The precise effect of Batch Normaliza¬ 
tion on gradient propagation remains an area of further 
study. 


3.4 Batch Normalization regularizes the 
model 


When training with Batch Normalization, a training ex¬ 
ample is seen in conjunction with other examples in the 
mini-batch, and the training network no longer produc¬ 
ing deterministic values for a given training example. In 
our experiments, we found this effect to be advantageous 
to the generalization o f the network. Whereas Dropout 


(ISrivastava et all 120141) is typically used to reduce over¬ 


fitting, in a batch-normalized network we found that it can 
be either removed or reduced in strength. 


4 Experiments 

4.1 Activations over time 


To verify the effects of internal covariate shift on train¬ 
ing, and the ability of Batch Normalization to combat it, 
we considered the p roblem of predicting the digit class on 


the MNIST dataset dLeCun et al.L11998al) . We used a very 


simple network, with a 28x28 binary image as input, and 
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(a) (b) Without BN 



(c) With BN 


Figure 1: (a) The test accuracy of the MNIST network 
trained with and without Batch Normalization, vs. the 
number of training steps. Batch Normalization helps the 
netn’ork train faster and achieve higher accuracy, (b, 
c) The evolution of input distributions to a typical sig¬ 
moid, over the course of training, shown as {15, 50,85}f/z 
percentiles. Batch Normalization makes the distribution 
more stable and reduces the internal covariate shift. 


details are given in the Appendix. We refer to this model 
as Inception in the rest of the text. The model was trained 
using a v ersion of Stochastic G radient Descent with mo¬ 
mentum ( Sutskeveretajl 2013 ). using the mini-batch size 
of 32. The training was performed using a large-scale, dis¬ 


tributed architecture (similar to (Deanet al., 2012)). All 


networks are evaluated as training progresses by comput¬ 
ing the validation accuracy @1, i.e. the probability of 
predicting the correct label out of 1000 possibilities, on 
a held-out set, using a single crop per image. 


In our experiments, we evaluated several modifications 
of Inception with Batch Normalization. In all cases. Batch 
Normalization was applied to the input of each nonlinear¬ 
ity, in a convolutional way, as described in section [T2l 
while keeping the rest of the architecture constant. 


3 fully-connected hidden layers with 100 activations each. 
Each hidden layer computes y = g(kFu+b) with sigmoid 
nonlinearity, and the weights W initialized to small ran¬ 
dom Gaussian values. The last hidden layer is followed 
by a fully-connected layer with 10 activations (one per 
class) and cross-entropy loss. We trained the network for 
50000 steps, with 60 examples per mini-batch. We added 
Batch Normalization to each hidden layer of the network, 
as in Sec. 13. II We were interested in the comparison be¬ 
tween the baseline and batch-normalized networks, rather 
than achieving the state of the art performance on MNIST 
(which the described architecture does not). 

Figure |TJa) shows the fraction of correct predictions 
by the two networks on held-out test data, as training 
progresses. The batch-normalized network enjoys the 
higher test accuracy. To investigate why, we studied in¬ 
puts to the sigmoid, in the original network N and batch- 
normalized network NjT N - (Alg.O over the course of train¬ 
ing. In Fig. [lib, c) we show, for one typical activation from 
the last hidden layer of each network, how its distribu¬ 
tion evolves. The distributions in the original network 
change significantly over time, both in their mean and 
the variance, which complicates the training of the sub¬ 
sequent layers. In contrast, the distributions in the batch- 
normalized network are much more stable as training pro¬ 
gresses, which aids the training. 


4.2 ImageNet classification 


We applied Batch Normalization to_a new variant of the 


Inception network (ISzegedv et all 120141). trained on th e 


ImageNet classification task (iRussakovskv et all 12014) . 


The network has a large number of convolutional and 
pooling layers, with a softmax layer to predict the image 
class, out of 1000 possibilities. Convolutional layers use 
ReFU as the nonlinearity. The main difference to the net¬ 
work described in ( Szeeedv et all 2014 ) is that the 5x5 
convolutional layers are replaced by two consecutive lay¬ 
ers of 3 x 3 convolutions with up to 128 filters. The net¬ 
work contains 13.6 ■ 10 6 parameters, and, other than the 
top softmax layer, has no fully-connected layers. More 


4.2.1 Accelerating BN Networks 


Simply adding Batch Normalization to a network does not 
take full advantage of our method. To do so, we further 
changed the network and its training parameters, as fol¬ 
lows: 


Increase learning rate. In a batch-normalized model, 
we have been able to achieve a training speedup from 
higher learning rates, with no ill side effects (Sec. 13.3I >. 

Remove Dropout. As described in Sec. 13.41 Batch Nor¬ 
malization fulfills some of the same goals as Dropout. Re¬ 
moving Dropout from Modified BN-Inception speeds up 
training, without increasing overhtting. 

Reduce the L > weight regularization. While in Incep¬ 
tion an L ‘2 loss on the model parameters controls overht¬ 
ting, in Modified BN-Inception the weight of this loss is 
reduced by a factor of 5. We find that this improves the 
accuracy on the held-out validation data. 

Accelerate the learning rate decay. In training Incep¬ 
tion, learning rate was decayed exponentially. Because 
our network trains faster than Inception, we lower the 
learning rate 6 times faster. 


Remove Local Respons e Normalization Whil e Incep¬ 
tion and other networks ( Srivastava et al. . 20141) benefit 
from it, we found that with Batch Normalization it is not 
necessary. 


Shuffle training examples more thoroughly. We enabled 
within-shard shuffling of the training data, which prevents 
the same examples from always appearing in a mini-batch 
together. This led to about 1% improvements in the val¬ 
idation accuracy, which is consistent with the view of 
Batch Normalization as a regularizer (Sec. 13.4b : the ran¬ 
domization inherent in our method should be most bene¬ 
ficial when it affects an example differently each time it is 
seen. 


Reduce the photometric distortions. Because batch- 
normalized networks train faster and observe each train¬ 
ing example fewer times, we let the trainer focus on more 
“real” images by distorting them less. 
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Figure 2: Single crop validation accuracy of Inception 
and its batch-normalized variants, vs. the number of 
training steps. 


Model 

Steps to 72.2% 

Max accuracy 

Inception 

31.0 ■ 10 B 

72.2% 

BN-Baseline 

13.3 • 10 6 

72.7% 

BN-x5 

2.1 • 10 6 

73.0% 

BN-x30 

2.7- 10 6 

74.8% 

BN-x5-Sigmoid 


69.8% 


Figure 3: For Inception and the batch-normalized 
variants, the number of training steps required to 
reach the maximum accuracy of Inception (72.2%), 
and the maximum accuracy achieved by the net¬ 
work. 


4.2.2 Single-Network Classification 

We evaluated the following networks, all trained on the 
LSVRC2012 training data, and tested on the validation 
data: 

Inception: the network described at the beginning of 
Section l4~2l trained with the initial learning rate of 0.0015. 

BN-Baseline: Same as Inception with Batch Normal¬ 
ization before each nonlinearity. 

BN-x5: Inception with Batch Normalization and the 
modifications in Sec. 14.2. fl The initial learning rate was 
increased by a factor of 5, to 0.0075. The same learning 
rate increase with original Inception caused the model pa¬ 
rameters to reach machine infinity. 

BN-x30: Like BN-x5 , but with the initial learning rate 
0.045 (30 times that of Inception). 

BN-x5-Sigmoid: Like BN-x5 , but with sigmoid non¬ 
linearity g(t) = 1+e:x .p^_ a; j instead of ReLU. We also at¬ 
tempted to train the original Inception with sigmoid, but 
the model remained at the accuracy equivalent to chance. 

In Figure [2] we show the validation accuracy of the 
networks, as a function of the number of training steps. 
Inception reached the accuracy of 72.2% after 31 • 10 6 
training steps. The Figure [3] shows, for each network, 
the number of training steps required to reach the same 
72.2% accuracy, as well as the maximum validation accu¬ 
racy reached by the network and the number of steps to 
reach it. 

By only using Batch Normalization (BN-Baseline), we 
match the accuracy of Inception in less than half the num¬ 
ber of training steps. By applying the modifications in 
Sec. HUD we significantly increase the training speed of 
the network. BN-x5 needs 14 times fewer steps than In¬ 
ception to reach the 72.2% accuracy. Interestingly, in¬ 
creasing the learning rate further ( BN-x30 ) causes the 
model to train somewhat slower initially, but allows it to 
reach a higher final accuracy. Breaches 74.8% after 6-10 6 
steps, i.e. 5 times fewer steps than required by Inception 
to reach 72.2%. 

We also verified that the reduction in internal covari¬ 
ate shift allows deep networks with Batch Normalization 


to be trained when sigmoid is used as the nonlinearity, 
despite the well-known difficulty of training such net¬ 
works. Indeed, BN-x5-Sigmoid achieves the accuracy of 
69.8%. Without Batch Normalization, Inception with sig¬ 
moid never achieves better than 1/1000 accuracy. 


4.2.3 Ensemble Classification 


The current reported best results on the ImageNet Large 
Scale Visual Recognition Competition are reached by the 


Deep Image ensemble of traditional models (Wuetal 


2015) and the ensemble model of (He et ah, 2015). The 


latter reports the top-5 error of 4.94%, as evaluated by the 
ILS VRC server. Here we report a top-5 validation error of 
4.9%, and test error of 4.82% (according to the ILSVRC 
server). This improves upon the previous best result, and 
ex ceeds the estimated accura cy of human raters according 


to (Russakovskv et al., 2014). 


For our ensemble, we used 6 networks. Each was based 
on BN-x30, modified via some of the following: increased 
initial weights in the convolutional layers; using Dropout 
(with the Dropout probability of 5% or 10%, vs. 40% 
for the original Inception); and using non-convolutional, 
per-activation Batch Normalization with last hidden lay¬ 
ers of the model. Each network achieved its maximum 
accuracy after about 6 • 10 6 training steps. The ensemble 
prediction was based on the arithmetic average of class 
probabilities predicted by the constituent networks. The 
details of ensemble and multicrop inference are similar to 
( Szegedv et ah . 20141) . 


We demonstrate in Fig. [4] that batch normalization al¬ 
lows us to set new state-of-the-art by a healthy margin on 
the ImageNet classification challenge benchmarks. 


5 Conclusion 

We have presented a novel mechanism for dramatically 
accelerating the training of deep networks. It is based on 
the premise that covariate shift, which is known to com¬ 
plicate the training of machine learning systems, also ap- 
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Model 

Resolution 

Crops 

Models 

Top-1 error 

Top-5 error 

GoogLeNet ensemble 

224 

144 

7 

- 

6.67% 

Deep Image low-res 

256 

- 

1 

- 

7.96% 

Deep Image high-res 

512 

- 

1 

24.88 

7.42% 

Deep Image ensemble 

variable 

- 

- 

- 

5.98% 

BN-Inception single crop 

224 

1 

1 

25.2% 

7.82% 

BN-Inception multicrop 

224 

144 

1 

21.99% 

5.82% 

BN-Inception ensemble 

224 

144 

6 

20.1% 

4.9% * 


Figure 4: Batch-Normalized Inception comparison with previous state of the art on the provided validation set com¬ 
prising 50000 images. *BN-Inception ensemble has reached 4.82% top-5 error on the 100000 images of the test set of 
the ImageNet as reported by the test server. 


plies to sub-networks and layers, and removing it from 
internal activations of the network may aid in training. 
Our proposed method draws its power from normalizing 
activations, and from incorporating this normalization in 
the network architecture itself. This ensures that the nor¬ 
malization is appropriately handled by any optimization 
method that is being used to train the network. To en¬ 
able stochastic optimization methods commonly used in 
deep network training, we perform the normalization for 
each mini-batch, and backpropagate the gradients through 
the normalization parameters. Batch Normalization adds 
only two extra parameters per activation, and in doing so 
preserves the representation ability of the network. We 
presented an algorithm for constructing, training, and per¬ 
forming inference with batch-normalized networks. The 
resulting networks can be trained with saturating nonlin¬ 
earities, are more tolerant to increased training rates, and 
often do not require Dropout for regularization. 

Merely adding Batch Normalization to a state-of-the- 
art image classification model yields a substantial speedup 
in training. By further increasing the learning rates, re¬ 
moving Dropout, and applying other modifications af¬ 
forded by Batch Normalization, we reach the previous 
state of the art with only a small fraction of training steps 
- and then beat the state of the art in single-network image 
classification. Furthermore, by combining multiple mod¬ 
els trained with Batch Normalization, we perform better 
than the best known system on ImageNet, by a significant 
margin. 


Interestingly, our method bears_similarity to the stan¬ 
dardization layer of (Giilcehre & Bengio, 2013), though 
the two methods stem from very different goals, and per¬ 
form different tasks. The goal of Batch Normalization 
is to achieve a stable distribution of activation values 
throughout training, and in our experiments we apply it 
before the nonlinearity since that is where matching the 
first and second moments is more likely to result in a 
stable distribution. On the contrary, ( Giilcehre & Bengiol 


2013) apply the standardization layer to the output of the 


nonlinearity, which results in sparser activations. In our 
large-scale image classification experiments, we have not 
observed the nonlinearity inputs to be sparse, neither with 
nor without Batch Normalization. Other notable differ¬ 


entiating characteristics of Batch Normalization include 
the learned scale and shift that allow the BN transform 
to represent identity (the standardization layer did not re¬ 
quire this since it was followed by the learned linear trans¬ 
form that, conceptually, absorbs the necessary scale and 
shift), handling of convolutional layers, deterministic in¬ 
ference that does not depend on the mini-batch, and batch- 
normalizing each convolutional layer in the network. 

In this work, we have not explored the full range of 
possibilities that Batch Normalization potentially enables. 
Our future work includes applications of our method to 
Recurrent Neural Networks ( Pascanu et al. . 20131) . where 
the internal covariate shift and the vanishing or exploding 
gradients may be especially severe, and which would al¬ 
low us to more thoroughly test the hypothesis that normal¬ 
ization improves gradient propagation (Sec. 13.3b . We plan 
to investigate whether Batch Normalization can help with 
domain adaptation, in its traditional sense - i.e. whether 
the normalization performed by the network would al¬ 
low it to more easily generalize to new data distribu¬ 
tions, perhaps with just a recomputation of the population 
means and variances (Alg. 0. Finally, we believe that fur¬ 
ther theoretical analysis of the algorithm would allow still 
more improvements and applications. 
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Appendix 


Variant of the Inception Model Used 

Figure 0 documents the changes that were performed 
compared to the architecture with respect to the 
GoogleNet archictect ure. For the interpr etation of this 
table, please consult ( Szeeedv et all 20141) . The notable 
architecture changes compared to the GoogLeNet model 
include: 


• The 5x5 convolutional layers are replaced by two 
consecutive 3x3 convolutional layers. This in¬ 
creases the maximum depth of the network by 9 
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weight layers. Also it increases the number of pa¬ 
rameters by 25% and the computational cost is in¬ 
creased by about 30%. 

• The number 28x28 inception modules is increased 
from 2 to 3. 

• Inside the modules, sometimes average, sometimes 
maximum-pooling is employed. This is indicated in 
the entries corresponding to the pooling layers of the 
table. 

• There are no across the board pooling layers be¬ 
tween any two Inception modules, but stride-2 con¬ 
volution/pooling layers are employed before the fil¬ 
ter concatenation in the modules 3c, 4e. 

Our model employed separable convolution with depth 
multiplier 8 on the first convolutional layer. This reduces 
the computational cost while increasing the memory con¬ 
sumption at training time. 
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type 

patch size/ 
stride 

output 

size 

depth 

#1x1 

#3x3 

reduce 

#3x3 

double #3x3 
reduce 

double 

#3x3 

Pool +proj 

convolution* 

7x7/2 

112x112x64 

1 







max pool 

3x3/2 

56x56x64 

0 







convolution 

3x3/1 

56x56x192 

1 


64 

192 




max pool 

3x3/2 

28x28x192 

0 







inception (3a) 


28x28x256 

3 

64 

64 

64 

64 

96 

avg + 32 

inception (3b) 


28x28x320 

3 

64 

64 

96 

64 

96 

avg + 64 

inception (3c) 

stride 2 

28x28x576 

3 

0 

128 

160 

64 

96 

max + pass through 

inception (4a) 


14x14x576 

3 

224 

64 

96 

96 

128 

avg +128 

inception (4b) 


14x14x576 

3 

192 

96 

128 

96 

128 

avg +128 

inception (4c) 


14x14x576 

3 

160 

128 

160 

128 

160 

avg +128 

inception (4d) 


14x14x576 

3 

96 

128 

192 

160 

192 

avg +128 

inception (4e) 

stride 2 

14x14x1024 

3 

0 

128 

192 

192 

256 

max + pass through 

inception (5a) 


7x7x1024 

3 

352 

192 

320 

160 

224 

avg +128 

inception (5b) 


7x7x1024 

3 

352 

192 

320 

192 

224 

max + 128 

avg pool 

7x7/1 

1x1x1024 

0 








Figure 5: Inception architecture 
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